Flexible results pipeline for processing element

ABSTRACT

A flexible results pipeline for a processing element of a parallel processor is described. A plurality of result registers are selectively connected to each other, to processing logic of the processing element and to a neighbourhood connection register configured to receive data from and send data to other processing elements. The connections between the result registers and between the result registers and the neighbourhood connection register are selectively configurable by applied control signals.

FIELD OF THE INVENTION

[0001] The present invention relates to a transferring data betweencomponents of a processing element in a parallel processor. Moreparticularly, the present invention relates transferring data betweenprocessing logic in the processing element and the inputs and outputs ofthe processing element.

BACKGROUND TO THE INVENTION

[0002] A simple computer generally includes a central processing unit(CPU) and a main memory. The CPU implements a sequence of operationsencoded in a stored program. The program and data on which the CPU actsis typically stored in the main memory. The processing of the programand the allocation of main memory and other resources are controlled byan operating system. In operating systems where multiple applicationsmay share and partition resources, the processing performance of thecomputer can be improved through use of active memory.

[0003] Active memory is memory that processes data as well as storingit. It can be instructed to operate on its contents without transferringits contents to the CPU or to any other part of the system. This istypically achieved by distributing parallel processors throughout thememory. Each parallel processor is connected to the memory and operateson it independently of the others. Most of the data processing isperformed within the active memory and the work of the CPU is thusreduced to the operating system tasks of scheduling processes andallocating system resources.

[0004] A block of active memory typically consists of the following: ablock of memory, e.g. dynamic random access memory (DRAM), aninterconnection block, and a memory processor (processing elementarray). The interconnection block provides a path that allows data toflow between the block of memory and the processing element array. Theprocessing element array typically includes multiple identicalprocessing elements controlled by a sequencer. Processing elements aregenerally small in area, have a low degree of hardware complexity, andare quick to implement, which leads to increased optimisation.Processing elements are usually designed to balance performance andcost. A simple more general-purpose processing element will result in ahigher level of performance than a more complex processing elementbecause it can be easily coupled to generate many identical processingelements. Further, because of its simplicity, the processing elementwill clock at a faster rate.

[0005] In any computer system, it is important that data is processedefficiently in order to maximise the speed of the processor. In aparallel processor containing a plurality of processing elements, it isimportant to maximise the speed of movement of data from an input to theprocessing element through processing logic to an output of theprocessing element.

[0006] Moreover, it is important to ensure that data generated by onepart of the processing element is ready use by another part or byanother processing element as and when it is required.

[0007] In a parallel processor, in which there is a plurality ofprocessing elements, in addition to transferring data between aparticular processing element and its memory or host CPU, often data istransferred between the individual processing elements. This addedcomplexity further increases the complexity of inputting and outputtingdata from the processing element and can further reduce the speed of theprocessing element.

[0008] Accordingly, it is an object of the present invention to provideefficient scheduling and transfer of data within the processing element.

[0009] It is a further object of the present invention to provide a moreflexible processing element, within which data can be efficientlytransferred between components of the processing element.

[0010] It is yet a further object of the present invention to providefaster transfer out of the processing element of results of processingoperations occurring therein.

SUMMARY OF THE INVENTION

[0011] Accordingly, the present invention provides a processing elementfor a parallel processor comprising:

[0012] processing logic; and

[0013] a plurality of result registers selectively connected to eachother;

[0014] wherein:

[0015] at least one of the result registers is selectively connected toreceive data from the processing logic;

[0016] at least one of the result registers is selectively connected tosend data to the processing logic; and

[0017] the connections between the result registers are selectivelyconfigurable by applied control signals.

[0018] Preferably, the processing element further comprises:

[0019] a neighbourhood connection register configured to receive datafrom and send data to other processing elements in the device;

[0020] wherein:

[0021] the neighbourhood connection register is selectively connected toreceive data from at least one of the result registers;

[0022] the neighbourhood connection register is selectively connected tosend data to at least one of the result registers; and

[0023] the connections between the result registers and theneighbourhood connection register are selectively configurable byapplied control signals.

[0024] Thus, the length and configuration of the result registerpipeline can be changed. This provides for more flexible processing ofdata. Moreover, the position of the neighbourhood connection register ina chain comprising result registers and the neighbourhood connectionregister can be changed. This provides for more efficient and flexibletransfer of data between the neighbourhood connection register and theprocessing logic (i.e. inputting of operands into the processing logicreceived from neighbouring processing elements or outputting results ofthe processing operations to neighbouring processing elements).

[0025] In one embodiment of the present invention, the processingelement further comprises a register file configured to transfer databetween the processing element and memory and/or a host connected to thedevice, wherein at least one of the result registers is selectivelyconnected to receive data from the register file and at least one of theresult registers is selectively connected to send data to the registerfile.

[0026] In another embodiment, the neighbourhood connection register isselectively connected to receive data from its own output. This way, theneighbourhood connection register can be used to store data between datatransfers in the processing element.

[0027] Preferably, the processing element further comprises:

[0028] a control circuit which receives and decodes control commandstransmitted to the processing element and generates the control signals;

[0029] at the input to each result register and the neighbourhoodconnection register, a selection circuit connected to the control logicfor selecting the input to each result register and the neighbourhoodconnection register according to the control signals.

[0030] Preferably, the selection circuit is a multiplexer.

[0031] Preferably, the configuration of the connections between theresult registers and the neighbourhood connection registers can be setsuch that data enters the result registers from different portions ofthe processing logic, enabling pipelining of processing operations inthe processing logic.

[0032] Advantageously, pipelining allows results of operations fromcertain portions of the processing logic, which are complete beforeoperations from other portions, to be output from the processingelement. This increases the speed at which data can be output from theprocessing element. In addition, results which are available beforeother results can be easily fed back into the processing logic, therebyincreasing the speed of the processing operations. Alternatively,results may be delayed until other results with which they are to becombined become available.

[0033] In a second aspect of the present invention, there is provided amethod of configuring a processing element for a parallel processor, inwhich there is provided processing logic and a plurality of resultregisters, at least one of which is connected to the processing logiccomprising the steps of:

[0034] receiving control signals from a control circuit in theprocessing element; and

[0035] changing the configuration of the connections between the resultregisters accordingly.

[0036] In a third aspect of the present invention, there is provided amethod of transferring data in a processing element for an active memorydevice, in which there is provided processing logic, a plurality ofresult registers, at least one of which is connected to the processinglogic and at least one of which is connected to another a neighbourhoodconnection register configured to receive data from and send data toother processing elements in the device and connected to at least one ofthe result registers comprising the steps of:

[0037] (a) transferring data between the processing logic and the atleast one result register connected to the processing logic;

[0038] (b) transferring data between the at least one result registerconnected to the neighbourhood result register and the neighbourhoodconnection register; and

[0039] (c) changing the configuration of the connections between theneighbourhood connection register and the result registers.

[0040] Preferably, the method further comprises the step of repeatingsteps (a) to (c).

BRIEF DESCRIPTION OF THE DRAWINGS

[0041] A specific embodiment will now be described by way of exampleonly and with reference to the accompanying drawings, in which:

[0042]FIG. 1 shows one embodiment of an active memory block inaccordance with the present invention;

[0043]FIG. 2 shows one embodiment of the components interconnections ofa processing element of the present invention;

[0044]FIG. 3 shows one embodiment of the components and interconnectionsof a register pipe of the present invention.

DETAILED DESCRIPTION

[0045] Referring to FIG. 1, one embodiment of an active memory block inaccordance with the invention is shown. Active memory block 100 includesa memory 106 and an array 110 of processing elements. Memory 106 ispreferably random access memory (RAM), in particular dynamic RAM (DRAM).The array 110 can communicate with memory 106 via an interconnectionblock 108. The interconnection block 108 can be any suitablecommunications path, such as a bi-directional high memory bandwidthpath. A central processing unit (CPU) 102 can communicate with activememory block 100 via a communications path 104. The communications path104 may be any suitable bi-directional path capable of transmittingdata.

[0046] Referring to FIG. 2, the components of one of a number of aprocessing elements 200 forming the array 110 are shown. The processingelement 200 includes processing logic 204, a result pipe 201 includingresult registers 202 and a neighbourhood connection register 203. Theresult pipe 201 is connected to a DRAM interface 210 via a register file208. Data is passed between the memory 106 and the processing element200 via the DRAM interface 210 and the register file 208. Data is passedfrom the result registers 202 to the processing logic 204 to beprocessed. The processing logic 204 passes the results of processingback to the result registers 202. Data from neighbouring processingelements 250 is received via input logic 206 into the neighbourhoodconnection register 203. Data is output to neighbouring processingelements 250 directly from the neighbourhood connection register 203 orfrom output logic 208 which may combine the data being output with datafrom other neighbouring processing elements 250.

[0047] The processing logic 204 may comprise a number of differentportions (not shown) into which data can be input and data can be outputseparately. These portions can include an arithmetic logic unit, acorresponding logical unit, shift control registers, condition registersand data shifting blocks.

[0048] Control logic 212 is connected to the DRAM interface 210, theregister file 208 and the result pipe 201. The control logic 212receives control commands sent to all of the processing elements in thearray 110 and generates control signals 218 which are sent to the resultpipe 201 to configure the connections between the result registers 202,the neighbourhood connection register 203 and the components connectedto them, i.e. the register file 208, processing logic 204, output logic208 and input logic 206.

[0049] The result pipe is connected to the processing logic 204 viaprocessing logic output and input interconnects 271, 272, to theregister file 208 via register file output and input interconnects 291,292, to the output logic via output interconnect 281, and to the inputlogic via input interconnect 282. The interconnects are 8-bit (byte)wide data wires between the components of the processing element 200

[0050] Referring to FIG. 3, the components of the result pipe 201 areshown. The result pipe comprises result registers 202 (first, second andthird result registers 310, 311, 312) and the neighbourhood connectionregister 203. At the input to each of the result registers 202 arefirst, second and third selection circuits 320, 321, 322 connected tothe first, second and third result registers 310, 311, 312 respectively.There is a neighbourhood connection register selection circuit 324connected at the input to the neighbourhood connection register 203. Theselection circuits each select one of four inputs applied to them for agiven configuration of control signals 218 and may comprise 8-bit 4:1multiplexers, as shown.

[0051] The inputs to and outputs from each of the selection circuits aregiven in Table 1 below: TABLE 1 Result and neighbourhood connectionregister inputs/outputs Inputs Register 0 1 2 3 Outputs RO PL0 PL1 RF,X, PL R1 R0 X RF PL2 R2, R3, PL R2 R1 X RF PL3 X, PL X P2 R1 R0 IL OL,R1, R2, X

[0052] As can be seen in FIG. 3 and from Table 1, the only input to theregister file 208 is from the first result register 310.

[0053] As mentioned above, data can be input into the result registers202 from different portions of the processing logic 204. Such portionsinclude an arithmetic logic unit PLO, a corresponding logical unit PL1,shift control registers PL2 and condition registers PL3. Generally datacould be output from each of the result registers 202 to the datashifting blocks (mentioned above).

[0054] The use of the selection circuits 321, 322, 324 allows the resultand neighbourhood connection registers 202, 203 to be chained togetherin different configurations. Possible configurations are: R0 → RI → R2 →X R0 → X → R1 → R2, R0 → R1 → X → R2, R0 → X → R2

[0055] Thus, data can be input to the neighbourhood connection register203 from neighbouring processing elements 250, the configuration of thechain can be changed so that the neighbourhood connection register 203is moved to a different location and the data therein output to thesecond or third result register 310, 311, 312 having a desired outputdestination (i.e. a desired portion of the processing logic or registerfile).

[0056] The chain also allows pipelining of data to take place in theprocessing logic 204 and between the processing logic 204 and theregister file 208. As will be appreciated, the results of someprocessing operations are available before results from other processingoperations. Using the flexible results pipeline described, the resultsof processing operations can be extracted from a given portion of theprocessing logic 204 before results from other portions. This extracteddata can then be output from the result pipe 201, either to the registerfile 208 or to the output logic 208 so that it can be output from theprocessing element 200 before the results from the other portions areavailable. In addition, the chain allows one or more results of a firstprocessing operations which are available before the entire firstoperations has completed to be fed back into the processing logic 204 tobe used in a second operation whilst the first operation completes.Moreover, it allows results to be delayed whilst other results or datawith which they are to be combined are made available.

[0057] In conclusion, the present invention allows data processing inprocessing elements in a parallel processor to occur at a higher rate.Data can be processed and output at a higher rate from the processingelements since pipelining can occur. The flexible positioning of theneighbourhood connection register 203 within the result pipe 201 helpsfacilitate this.

[0058] It will of course be understood that the present invention hasbeen described above purely by way of example and modifications ofdetail can be made within the scope of the invention.

1. A processing element for a parallel processor comprising: processinglogic; and a plurality of result registers selectively connected to eachother; wherein: at least one of the result registers is selectivelyconnected to receive data from the processing logic; at least one of theresult registers is selectively connected to send data to the processinglogic; and the connections between the result registers are selectivelyconfigurable by applied control signals.
 2. The processing element ofclaim 1, further comprising a register file configured to transfer databetween the processing element and memory and/or a host connected to thedevice, wherein at least one of the result registers is selectivelyconnected to receive data from the register file and at least one of theresult registers is selectively connected to send data to the registerfile.
 3. A processing element according to claim 1 or claim 2 furthercomprising: a neighbourhood connection register configured to receivedata from and send data to other processing elements in the device;wherein: the neighbourhood connection register is selectively connectedto receive data from at least one of the result registers; theneighbourhood connection register is selectively connected to send datato at least one of the result registers; and the connections between theresult registers and the neighbourhood connection register areselectively configurable by applied control signals.
 4. The processingelement of claim 3, wherein the neighbourhood connection register isselectively connected to its own output.
 5. The processing element ofany preceding claim, further comprising: a control circuit whichreceives and decodes control commands transmitted to the processingelement and generates the control signals; and at the input to eachresult register and, as the case may be, the neighbourhood connectionregister, a selection circuit connected to the control logic forselecting the input to the register according to the control signals. 6.The processing element of claim 5, wherein the selection circuit is amultiplexer.
 7. The processing element of any preceding claim, whereinthe configuration of the connections between the result registers and,as the case may be, the neighbourhood connection registers can be setsuch that data enters and/or exits the result registers from differentportions of the processing logic to enable pipelining of processingoperations in the processing logic.
 8. A method of configuring aprocessing element for a parallel processor, in which there is providedprocessing logic and a plurality of result registers, at least one ofwhich is connected to the processing logic comprising the steps of:receiving control signals from a control circuit in the processingelement; and changing the configuration of the connections between theresult registers accordingly.
 9. A method according to claim 8 in whichin the processing element there is further provided a neighbourhoodconnection register configured to receive data from and send data toother processing elements in the device and connected to at least one ofthe result registers, the method further comprising the step of:changing the configuration of the connections between the neighbourhoodconnection register and the result registers according to the receivedcontrol signals.
 10. A method of transferring data in a processingelement for an active memory device, in which there is providedprocessing logic, a plurality of result registers, at least one of whichis connected to the processing logic and at least one of which isconnected to another a neighbourhood connection register configured toreceive data from and send data to other processing elements in thedevice and connected to at least one of the result registers comprisingthe steps of: (a) transferring data between the processing logic and theat least one result register connected to the processing logic; (b)transferring data between the at least one result register connected tothe neighbourhood result register and the neighbourhood connectionregister; and (c) changing the configuration of the connections betweenthe neighbourhood connection register and the result registers.
 11. Themethod of claim 10, further comprising the step of repeating steps (a)to (c).